Each year, a small number of FBS college football teams use some version of the Option Offense. This project tests the common claim that prior experience with the Option improves defensive performance against it. We examine play-by-play data and find evidence to support the claim. The results suggest that defenses with little prior experience can expect to give up over a touchdown more per game against Option Offenses than their highly experienced counterparts.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
from sklearn import linear_model
from scipy.sparse import lil_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
pd.options.mode.chained_assignment = None
The NCAA Division I Football Bowl Subdivison (FBS) is the highest division of college football in the United States. It currently contains 133 teams who typically play 12 regular season games per year, plus a conference championship and / or bowl game if they qualify.1 At least 11 of those games must be against fellow FBS schools.2 Each team represents its own university. Universities with FBS teams can be found in almost every state in the USA, and include large public universities, small private schools, schools with religious affiliations, and military academies.
Image('img/map.png')
FBS schools closest to each US county, 2021.3
Many FBS teams use fairly standard offensive schemes similar to the ones used in the NFL. Because the FBS is so large, though, and because its member teams come from such a wide variety of circumstances, some of them have unique play styles. Many teams face severe limitations, especially when it comes to recruiting high school players. Some selective FBS universities hold members of their football team to (relatively) strict academic standards, so those teams are only able to recruit high school players with strong academic records. The FBS military academies (Army, Air Force, and Navy) face the most severe limitations: they can only recruit players who are willing to remain in the military after graduation. (Star players for the military academies are occasionally granted waivers to try their luck in the NFL, though they remain members of the military.4)
If these teams used the same strategies as teams without their limitations, they would likely be unable to compete due to the talent deficit. Consequently, many of them alter their offensive strategy to try to level the playing field. The military academies, among a few other teams, run versions of the Flexbone Option offense, also known as the Triple Option or Veer.5 Unlike most modern offenses, they gain a majority of their yards by running the ball as opposed to passing. Because Option Offenses are so different from normal FBS offenses, their opponents typically expend a lot of effort preparing to face them. For instance, Southern Methodist University used 2 out of their 15 spring practices in 2018 preparing to defend Navy that fall.6
If opponents have to specially prepare for Option-like offenses, we might expect teams to perform better against those offenses as they gain more experience against them. Indeed, this is a common truism among college football fans and analysts. In this project, we put that claim to the test: does the data show that prior experience with Option Offenses improves a defense’s performance against them? If so, how large is the effect?
The data used to address this question was drawn from https://collegefootballdata.com/, a community-driven project that maintains APIs for data sets on college football teams, players, coaches, and seasons. For the purpose of this analysis, 4 APIs were accessed.
# FBS teams by year
team_df = pd.read_csv('data/fbs_teams_2004_2023.csv')
team_df
| id | school | season | |
|---|---|---|---|
| 0 | 2005 | Air Force | 2004 |
| 1 | 2006 | Akron | 2004 |
| 2 | 333 | Alabama | 2004 |
| 3 | 12 | Arizona | 2004 |
| 4 | 9 | Arizona State | 2004 |
| ... | ... | ... | ... |
| 2498 | 98 | Western Kentucky | 2023 |
| 2499 | 2711 | Western Michigan | 2023 |
| 2500 | 277 | West Virginia | 2023 |
| 2501 | 275 | Wisconsin | 2023 |
| 2502 | 2751 | Wyoming | 2023 |
2503 rows × 3 columns
# Game information
games_df = pd.read_csv('data/game_info_2004_2023.csv', low_memory=False)
games_df[['id','season','week','home_team','away_team','neutral_site','home_division','away_division']]
| id | season | week | home_team | away_team | neutral_site | home_division | away_division | |
|---|---|---|---|---|---|---|---|---|
| 0 | 242410193 | 2004 | 1 | Miami (OH) | Indiana State | False | fbs | fcs |
| 1 | 242410259 | 2004 | 1 | Virginia Tech | USC | False | fbs | fbs |
| 2 | 242462199 | 2004 | 2 | Eastern Michigan | Buffalo | False | fbs | fbs |
| 3 | 242460254 | 2004 | 2 | Utah | Texas A&M | False | fbs | fbs |
| 4 | 242462711 | 2004 | 2 | Western Michigan | UT Martin | False | fbs | fcs |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15747 | 401539484 | 2023 | 14 | Troy | Appalachian State | False | fbs | fbs |
| 15748 | 401539483 | 2023 | 14 | Alabama | Georgia | True | fbs | fbs |
| 15749 | 401539478 | 2023 | 14 | Florida State | Louisville | True | fbs | fbs |
| 15750 | 401539480 | 2023 | 14 | Iowa | Michigan | True | fbs | fbs |
| 15751 | 401520445 | 2023 | 15 | Army | Navy | True | fbs | fbs |
15752 rows × 8 columns
first_year = 2006 # 2006 for 2-year experience window
last_year = 2023
year_list = list(range(first_year, last_year+1))
year_list.remove(2020)
Since cfb_data_PBP_api.py stores each season individually, we concatenate the data from each year into a single large dataframe.
# Play-by-play
PBP_df_list = []
for year in year_list:
PBP_df_year = pd.read_csv('data/game_PBP_{}.csv'.format(year), low_memory=False)
PBP_df_year['season'] = year
PBP_df_list.append(PBP_df_year)
PBP_df = pd.concat(PBP_df_list).reset_index(drop=True)
PBP_df[['game_id','offense','defense','home','away','play_type','ppa']]
| game_id | offense | defense | home | away | play_type | ppa | |
|---|---|---|---|---|---|---|---|
| 0 | 262430009 | Arizona State | Northern Arizona | Arizona State | Northern Arizona | Pass | -1.225679 |
| 1 | 262430009 | Arizona State | Northern Arizona | Arizona State | Northern Arizona | Pass | -0.472584 |
| 2 | 262430009 | Arizona State | Northern Arizona | Arizona State | Northern Arizona | Punt | NaN |
| 3 | 262430009 | Arizona State | Northern Arizona | Arizona State | Northern Arizona | Pass | 3.197009 |
| 4 | 262430009 | Arizona State | Northern Arizona | Arizona State | Northern Arizona | Pass | 0.400748 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2728110 | 401618179 | Albany | South Dakota State | South Dakota State | Albany | Rush | 0.229173 |
| 2728111 | 401618179 | Albany | South Dakota State | South Dakota State | Albany | Pass Incompletion | -0.101343 |
| 2728112 | 401618179 | South Dakota State | Albany | South Dakota State | Albany | Rush | 0.722464 |
| 2728113 | 401618179 | South Dakota State | Albany | South Dakota State | Albany | Rush | -0.388112 |
| 2728114 | 401618179 | South Dakota State | Albany | South Dakota State | Albany | Rush | -0.105599 |
2728115 rows × 7 columns
# Season team statistics
season_df = pd.read_csv('data/team_season_stats_2004_2023.csv')
season_df
| season | team | conference | stat_name | stat_value | |
|---|---|---|---|---|---|
| 0 | 2004 | Oregon State | Pac-10 | passCompletions | 287 |
| 1 | 2004 | Texas Tech | Big 12 | rushingTDs | 23 |
| 2 | 2004 | Western Michigan | Mid-American | fumblesLost | 13 |
| 3 | 2004 | Minnesota | Big Ten | puntReturnTDs | 0 |
| 4 | 2004 | New Mexico State | Sun Belt | passCompletions | 209 |
| ... | ... | ... | ... | ... | ... |
| 75381 | 2023 | Coastal Carolina | Sun Belt | sacks | 20 |
| 75382 | 2023 | Purdue | Big Ten | kickReturns | 22 |
| 75383 | 2023 | UMass | FBS Independents | passingTDs | 14 |
| 75384 | 2023 | Eastern Michigan | Mid-American | thirdDowns | 180 |
| 75385 | 2023 | Western Kentucky | Conference USA | firstDowns | 263 |
75386 rows × 5 columns
We pivot the season statistics on stat_name to get each statistic as its own column. We also drop any rows with null rushingAttempts values, since this is the most important statistic for our purposes.
season_df = pd.pivot_table(season_df, index=['team','season'],
values='stat_value', columns='stat_name')
season_df = season_df.dropna(subset='rushingAttempts')
season_df = season_df.reset_index()
season_df
| stat_name | team | season | firstDowns | fourthDownConversions | fourthDowns | fumblesLost | fumblesRecovered | games | interceptionTDs | interceptionYards | ... | puntReturns | rushingAttempts | rushingTDs | rushingYards | sacks | tacklesForLoss | thirdDownConversions | thirdDowns | totalYards | turnovers |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Air Force | 2004 | 261.0 | 7.0 | 12.0 | 8.0 | 8.0 | 11.0 | 0.0 | 102.0 | ... | 23.0 | 648.0 | 31.0 | 3051.0 | NaN | NaN | 78.0 | 166.0 | 4680.0 | 16.0 |
| 1 | Air Force | 2005 | 244.0 | 17.0 | 23.0 | 14.0 | 8.0 | 11.0 | 0.0 | 91.0 | ... | 23.0 | 588.0 | 28.0 | 2712.0 | NaN | NaN | 59.0 | 149.0 | 4590.0 | 24.0 |
| 2 | Air Force | 2006 | 241.0 | 12.0 | 22.0 | 10.0 | 11.0 | 12.0 | 0.0 | 67.0 | ... | 18.0 | 660.0 | 21.0 | 2752.0 | NaN | NaN | 95.0 | 174.0 | 3967.0 | 14.0 |
| 3 | Air Force | 2007 | 277.0 | 17.0 | 25.0 | 13.0 | 15.0 | 13.0 | 1.0 | 68.0 | ... | 24.0 | 721.0 | 36.0 | 3894.0 | NaN | NaN | 79.0 | 188.0 | 5452.0 | 18.0 |
| 4 | Air Force | 2008 | 240.0 | 13.0 | 24.0 | 11.0 | 21.0 | 13.0 | 1.0 | 90.0 | ... | 36.0 | 777.0 | 27.0 | 3470.0 | NaN | NaN | 87.0 | 203.0 | 4538.0 | 17.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2495 | Wyoming | 2019 | 230.0 | 6.0 | 12.0 | 6.0 | 12.0 | 13.0 | 2.0 | 199.0 | ... | 34.0 | 575.0 | 27.0 | 2792.0 | 44.0 | 104.0 | 81.0 | 188.0 | 4562.0 | 15.0 |
| 2496 | Wyoming | 2020 | 111.0 | 6.0 | 12.0 | 4.0 | 6.0 | 6.0 | 0.0 | 60.0 | ... | 12.0 | 275.0 | 16.0 | 1317.0 | 17.0 | 46.0 | 27.0 | 89.0 | 2237.0 | 9.0 |
| 2497 | Wyoming | 2021 | 241.0 | 7.0 | 20.0 | 6.0 | 8.0 | 13.0 | 3.0 | 138.0 | ... | 12.0 | 543.0 | 24.0 | 2752.0 | 24.0 | 57.0 | 75.0 | 173.0 | 4867.0 | 18.0 |
| 2498 | Wyoming | 2022 | 208.0 | 2.0 | 7.0 | 3.0 | 12.0 | 13.0 | 1.0 | 54.0 | ... | 7.0 | 479.0 | 16.0 | 2358.0 | 37.0 | 66.0 | 61.0 | 175.0 | 4077.0 | 15.0 |
| 2499 | Wyoming | 2023 | 242.0 | 6.0 | 9.0 | 5.0 | 11.0 | 13.0 | 1.0 | 120.0 | ... | 9.0 | 468.0 | 19.0 | 2060.0 | 20.0 | 51.0 | 56.0 | 152.0 | 4251.0 | 11.0 |
2500 rows × 34 columns
In order to perform a data-driven analysis, some key terms need to be operationalized. First, we have to define the set of offenses we're interested in. This presents some difficulties. As mentioned, the military academies are prototypical examples of teams that use Option Offenses, but they aren’t the only ones. Further, there are many types of offense that might be given the "Option" label, and concepts from Option Offenses have found their way into more normal offenses.
The most important trait common to all Option Offenses is that they rush the ball much more than most teams. We can leverage this fact to overcome the difficulties with defining "Option Offense". Using season statistics from 2004-2023, we calculate the percentage of a team's total offensive plays that were rushing attempts:
$$(1) \quad \text{rushing_pct} = \frac{\text{rushingAttempts}}{\text{rushingAttempts} + \text{passAttempts}}.$$season_df['rushing_pct'] = season_df['rushingAttempts']/(season_df['rushingAttempts'] + season_df['passAttempts'])
The resulting variable is displayed in the histogram below. Blue bars represent all FBS teams (2004-2023) and orange bars represent the military academies. The dashed line is placed at the 97th rushing_pct percentile. While the distribution is mostly symmetrical around ~0.55, there's a heavy right tail above the 97th percentile (represented by the dashed line). These teams rushed the ball on at least 76.4% of their offensive plays. Almost all of the military academy teams from 2004-2023 are found in this right tail.
option_academy_season_df = season_df[ # season stats for military academies
((season_df['team']=='Air Force')&(season_df['season'].isin(year_list+[2004,2005,2020])))|(
(season_df['team']=='Army')&(season_df['season'].isin(year_list+[2004,2005,2020])))|(
(season_df['team']=='Navy')&(season_df['season'].isin(year_list+[2004,2005,2020])))]
ax = sns.histplot(binwidth=0.025, binrange=(0,1), data=season_df['rushing_pct'])
ax2 = sns.histplot(binwidth=0.025, binrange=(0,1), color='orange',
data=option_academy_season_df['rushing_pct'])
ax2.axvline(x=season_df['rushing_pct'].quantile(0.97), color='gray', ls='--')
ax2.set(xlim=(0,1), xticks=[i/10 for i in range(1,11)])
import matplotlib.patches as mpatches
handles=[mpatches.Patch(color='b', label='All FBS Teams'),
mpatches.Patch(color='orange', label='Military Academies')]
plt.title("FBS Rushing Percentage 2004-2023")
plt.legend(handles=handles,framealpha=0.5)
plt.show()
'97th percentile, rushing_pct: ' + str(round(season_df['rushing_pct'].quantile(0.97), 3))
'97th percentile, rushing_pct: 0.764'
season_df['rushing_off'] = 0
season_df.loc[season_df['rushing_pct'] >= season_df['rushing_pct'].quantile(0.97), 'rushing_off'] = 1
rush_df = season_df[season_df['rushing_off']==1][['team','season']]
'# of Rushing Offenses: ' + str(len(rush_df))
'# of Rushing Offenses: 75'
The remainder of the analysis will focus on "Rushing Offenses", where a Rushing Offense is an offense with a season rushing_pct at or above the 97th percentile.
"Rushing Offense" is a good, precise proxy for the rather nebulous and difficult-to-define "Option Offense" category. Some teams might have a high rushing_pct because of unusual circumstances during a specific season: maybe their quarterbacks get injured or they have especially great running backs. Schools with teams running Option offenses, on the other hand, will consistently have a high rushing_pct across multiple seasons: they run the Option because it's the offense their coach runs and / or they face institutional limitations.
We can see this with the 75 Rushing Offenses in the sample when they're broken down by season. Besides Rice at the beginning of the timeline, schools with Rushing Offenses have them across multiple seasons.
rush_df_display = pd.DataFrame(columns=sorted(rush_df['season'].unique()), index=rush_df['team'].unique())
for team in rush_df_display.index:
rush_df_team = rush_df[rush_df['team']==team]
for season in rush_df_display.columns:
rush_df_display.loc[team, season] = ''
if len(rush_df_team[rush_df_team['season']==season]) > 0:
rush_df_display.loc[team, season] = 'X'
rush_df_display
| 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Air Force | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | |
| Army | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | ||||
| Georgia Southern | X | X | X | X | X | X | ||||||||||||||
| Georgia Tech | X | X | X | X | X | X | X | X | X | X | ||||||||||
| Navy | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | |
| New Mexico | X | X | X | X | ||||||||||||||||
| Rice | X |
There are some gaps (Air Force 2005, Georgia Southern 2016, Georgia Tech 2015, New Mexico 2015) because we set the cutoff at the 97th percentile. If the cutoff was lowered to the 96th percentile, each of these teams would also be included, but so would many one-off teams that don't really meet the criteria we're looking for.
With a set of Rushing Offenses in hand, we indicate whether each game includes a Rushing Offense for the home team and/or away team.
games_df = games_df.merge(rush_df,
left_on=['home_team','season'],
right_on=['team','season'],
how='left', indicator='home_rushing_team')
games_df['home_rushing_team'] = games_df['home_rushing_team'].eq('both')
games_df = games_df.drop(columns='team')
games_df = games_df.merge(rush_df,
left_on=['away_team','season'],
right_on=['team','season'],
how='left', indicator='away_rushing_team')
games_df['away_rushing_team'] = games_df['away_rushing_team'].eq('both')
games_df = games_df.drop(columns='team')
The previous section refined our question a bit: does prior experience defending Rushing Offenses improves a defense’s performance against them? We now need to get clear on what we mean by experience against Rushing Offenses.
We can distinguish two ways a defense gains experience against Rushing Offenses:
If a defense has recently played against a Rushing Offense, the players and coaches who were with the team at the time should be able to build on their previous preparation when getting ready to face more Rushing Offenses. They also should be able to learn from those previous games to improve their tactics. This can build up institutional knowledge about how to prepare for and play against Rushing Offenses that helps the players and coaches who weren't around for those previous games.
We define a defense's Game Experience Score (GES) heading into each game. For a given game and defense, the defense's Game Experience Score is the number of games the defense has played against Rushing Offenses in the 2 prior seasons and the current season before the current game. A 2-season cutoff for Game Experience seems reasonable because we can reasonably expect that many important players (Juniors and Seniors) and coaches were with the team for the games in that window.
Below, we go through each game and calculate the GES for the home and away teams. The printed example shows Notre Dame's GES in games against Rushing Teams during the 2009-2011 seasons. Going into the 2011 Week 6 matchup against Air Force, ND had a GES of 3 because of the 2009 Navy, 2010 Navy, and 2010 Army games. Going into the 2011 Week 9 matchup against Navy, ND had a GES of 4 because of the 3 games in 2009-2010 plus the Air Force game earlier that season.
n_years = 2 # number of years back to count Game Experience
games_df['home_experience'] = 0
games_df['away_experience'] = 0
for i, row in games_df[games_df['season'].isin(year_list)].iterrows():
home = row['home_team']
away = row['away_team']
season = row['season']
week = row['week']
games_df1 = games_df[(games_df['season']<=season)&(games_df['season']>=season-n_years)]
# Input home team Game Experience
home_experience = len(games_df1[((games_df1['season']<season)|(
(games_df1['season']==season)&(games_df1['week']<week)))&(((games_df1['home_team']==home)&(
games_df1['away_rushing_team']==True))|((games_df1['away_team']==home)&(
games_df1['home_rushing_team']==True)))])
games_df.loc[i, 'home_experience'] = home_experience
# Input away team Game Experience
away_experience = len(games_df1[((games_df1['season']<season)|(
(games_df1['season']==season)&(games_df1['week']<week)))&(((games_df1['home_team']==away)&(
games_df1['away_rushing_team']==True))|((games_df1['away_team']==away)&(
games_df1['home_rushing_team']==True)))])
games_df.loc[i, 'away_experience'] = away_experience
# Print example
if (home=='Notre Dame')&(season==2011)&(week==9):
print(games_df.loc[games_df1[(games_df1['season']<=season)&(((games_df1['home_team']==home)&(
games_df1['away_rushing_team']==True))|((games_df1['away_team']==home)&(
games_df1['home_rushing_team']==True)))].index,
['home_team','away_team','season','week','home_experience','away_experience']])
home_team away_team season week home_experience away_experience 4242 Notre Dame Navy 2009 10 3 4 4891 Navy Notre Dame 2010 8 5 2 5140 Notre Dame Army 2010 12 3 5 5578 Notre Dame Air Force 2011 6 3 5 5733 Notre Dame Navy 2011 9 4 5
Before discussing the second way a defense gains experience against Rushing Offenses, we'll do some more preprocessing. Using the GES as a measure of experience forces us to truncate the dataset. Since we only have team season statistics going back to 2004, we can only calculate the GES starting in 2006, since this calculation requires us to find Rushing Offenses on the schedule in the previous two years. As a result, game and PBP data from 2004 and 2005 are discarded for the final analysis. We also discard data from the pandemic-disrupted 2020 season.
games_df = games_df[games_df['season'].isin(year_list)] # year_list = [2006, ..., 2023] without 2020.
team_df = team_df[team_df['season'].isin(year_list)]
Next, we discard game and PBP data from games that featured non-FBS teams. These games are generally uncompetitive, data on non-FBS teams is less reliable, and including them would just complicate the analysis to little benefit.
for year in year_list: # non-FBS away teams
team_list = team_df.loc[team_df['season']==year, 'school'].to_list()
games_df_year = games_df[games_df['season']==year]
PBP_df_year = PBP_df[PBP_df['season']==year]
games_df = games_df.drop(games_df_year[~games_df_year['away_team'].isin(team_list)].index)
PBP_df = PBP_df.drop(PBP_df_year[~PBP_df_year['away'].isin(team_list)].index)
games_df = games_df.reset_index(drop=True)
PBP_df = PBP_df.reset_index(drop=True)
for year in year_list: # non-FBS home teams
team_list = team_df.loc[team_df['season']==year, 'school'].to_list()
games_df_year = games_df[games_df['season']==year]
PBP_df_year = PBP_df[PBP_df['season']==year]
games_df = games_df.drop(games_df_year[~games_df_year['home_team'].isin(team_list)].index)
PBP_df = PBP_df.drop(PBP_df_year[~PBP_df_year['home'].isin(team_list)].index)
games_df = games_df.reset_index(drop=True)
PBP_df = PBP_df.reset_index(drop=True)
It will be convenient to create a team key with a school name / season string.
team_df['school'] = team_df['school'] + team_df['season'].astype(str)
games_df['home_team'] = games_df['home_team'] + games_df['season'].astype(str)
games_df['away_team'] = games_df['away_team'] + games_df['season'].astype(str)
PBP_df['home'] = PBP_df['home'] + PBP_df['season'].astype(str)
PBP_df['away'] = PBP_df['away'] + PBP_df['season'].astype(str)
PBP_df['offense'] = PBP_df['offense'] + PBP_df['season'].astype(str)
PBP_df['defense'] = PBP_df['defense'] + PBP_df['season'].astype(str)
season_df['team'] = season_df['team'] + season_df['season'].astype(str)
season_df = season_df.set_index('team')
rushing_teams = list(season_df.loc[(season_df['season'].isin(year_list))&(season_df['rushing_off']==1)].index)
We add a variable to the PBP data indicating whether the offense for a given play is a Rushing Offense.
PBP_df = PBP_df.join(season_df[['rushing_off']], on='offense')
Finally, we add the Game Experience Scores from the game data to the PBP data and create a variable specifically for the defense's GES on a given play.
PBP_df = PBP_df.merge(games_df[['id','home_experience','away_experience']],
left_on='game_id', right_on='id', suffixes=(None, "_temp"))
PBP_df['defense_GES'] = (PBP_df['away']==PBP_df['offense'])*PBP_df['home_experience'] + (
PBP_df['home']==PBP_df['offense'])*PBP_df['away_experience']
For plays with a Rushing Offense, the most common defense GES is 2, followed by 1 and 0. A GES above 5 is rare. For plays without a Rushing Offense, on the other hand, the most common defense GES is 0 by a wide margin.
temp = PBP_df[['defense_GES','rushing_off','away']].pivot_table(
index='defense_GES', columns='rushing_off', aggfunc='count', fill_value=0)
temp.columns = temp.columns.droplevel()
temp/temp.sum()
| rushing_off | 0 | 1 |
|---|---|---|
| defense_GES | ||
| 0 | 0.610052 | 0.191036 |
| 1 | 0.158930 | 0.225195 |
| 2 | 0.129425 | 0.298514 |
| 3 | 0.061790 | 0.110498 |
| 4 | 0.024961 | 0.094330 |
| 5 | 0.010869 | 0.056649 |
| 6 | 0.002758 | 0.017756 |
| 7 | 0.000667 | 0.004923 |
| 8 | 0.000364 | 0.001101 |
| 9 | 0.000185 | 0.000000 |
If a team has a Rushing Offense, we'd expect that team's defense to be better prepared to face other Rushing Offenses. We add a variable to the PBP data indicating whether the defense for a given play has a Rushing Offense on its team.
PBP_df['rushing_team_def'] = 0
PBP_df.loc[PBP_df['defense'].isin(rushing_teams), 'rushing_team_def'] = 1
Of the plays with a Rushing Offense, 13.2% were defended by a defense with another Rushing Offense on its team.
temp = PBP_df[['rushing_team_def', 'rushing_off', 'away']].pivot_table(
index='rushing_team_def', columns='rushing_off', aggfunc='count', fill_value=0)
temp.columns = temp.columns.droplevel()
temp/temp.sum()
| rushing_off | 0 | 1 |
|---|---|---|
| rushing_team_def | ||
| 0 | 0.974888 | 0.868412 |
| 1 | 0.025112 | 0.131588 |
We have precise definitions of Rushing Offenses and experience against them. The last ingredient we need is a measure of performance. The PBP data contains a ppa variable that's a good candidate. PPA, short for Predicted (or Expected) Points Added, is a measure of how a given play is expected to affect the scoring margin.7
A positive PPA indicates a good play for the offense. For instance, in the play below from 2006, Texas Tech faced a 4th and 13 against SMU and rushed for 16 yards, gaining the first down and then some. This was a great result for the offense, a fact that's reflected in the high PPA of 3.39.
PBP_df.loc[PBP_df['id']==262452641031, ['offense','defense', 'yard_line', 'down',
'distance', 'yards_gained', 'play_type','ppa']]
| offense | defense | yard_line | down | distance | yards_gained | play_type | ppa | |
|---|---|---|---|---|---|---|---|---|
| 6000 | Texas Tech2006 | SMU2006 | 64 | 4 | 13 | 16 | Rush | 3.386074 |
Conversely, a negative PPA indicates a good play for the defense. In the play below from 2007, Arkansas had a 1st and 10 against South Carolina and rushed for only 1 yard. This was a pretty good result for the defense, which is reflected in the somewhat low PPA of -0.82.
PBP_df.loc[PBP_df['id']==273070008173, ['offense','defense', 'yard_line', 'down',
'distance', 'yards_gained', 'play_type','ppa']]
| offense | defense | yard_line | down | distance | yards_gained | play_type | ppa | |
|---|---|---|---|---|---|---|---|---|
| 180009 | Arkansas2007 | South Carolina2007 | 53 | 1 | 10 | 1 | Rush | -0.81639 |
Using PPA as our performance measure, the goal is now to test whether a defense's prior experience against Rushing Offenses has an impact on PPA when playing against them.
We drop plays with null PPA as well as special teams plays like punts and kickoffs, leaving us with 1,582,844 plays in the dataset. The mean PPA is 0.127 with a standard deviation of 1.226. Rushing Offenses average 0.177 PPA, which is somewhat higher than 0.125 PPA for regular offenses.
PBP_df = PBP_df.dropna(subset=['ppa'])
drop_play_types = ['Punt Return Touchdown','Blocked Punt Touchdown','Blocked Field Goal Touchdown',
'Kickoff Return Touchdown','placeholder','Field Goal Good']
PBP_df = PBP_df[~PBP_df['play_type'].isin(drop_play_types)]
PBP_df = PBP_df.reset_index(drop=True)
pd.DataFrame(PBP_df['ppa'].describe())
| ppa | |
|---|---|
| count | 1.582844e+06 |
| mean | 1.265052e-01 |
| std | 1.225548e+00 |
| min | -1.278676e+01 |
| 25% | -5.835822e-01 |
| 50% | -1.308110e-01 |
| 75% | 6.980426e-01 |
| max | 7.382809e+00 |
PBP_df.groupby('rushing_off')['ppa'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| rushing_off | ||||||||
| 0 | 1534981.0 | 0.124945 | 1.227521 | -12.786765 | -0.585961 | -0.131937 | 0.698478 | 7.382809 |
| 1 | 47863.0 | 0.176547 | 1.159376 | -11.947510 | -0.459526 | -0.094911 | 0.678252 | 7.337203 |
We analyze the effect of prior experience on performance against Rushing Offenses using a linear regression model. We'll control for offensive strength, defensive strength, and the effect of home-field advantage. The model assumes that PPA on a given play can be estimated as a linear combination of the following variables:
We're primarily interested in the estimates of the coefficients on the two interaction terms. If prior experience against Rushing Offenses improves performance against them, one or both of these coefficients should be negative (since lower PPA indicates better defensive performance).
We add them to the PBP data.
PBP_df['defense_GES_x_rushing_off'] = PBP_df['defense_GES']*PBP_df['rushing_off']
PBP_df['rushing_team_def_x_rushing_off'] = PBP_df['rushing_team_def']*PBP_df['rushing_off']
We also add the hfa variable to the PBP data indicating whether the offense, defense, or neither has the home-field advantage.
(The home-field advantage and dummy variable code blocks were adapted from a blog post by Bud Davis at 8.)
neutralGames = games_df['id'][games_df['neutral_site']==True].to_list()
PBP_df['hfa'] = None # homefield advantage
PBP_df.loc[PBP_df['home'] == PBP_df['offense'], 'hfa'] = 1 # home team on offense
PBP_df.loc[PBP_df['home'] == PBP_df['defense'], 'hfa'] = -1 # away team on offense
PBP_df.loc[PBP_df['game_id'].isin(neutralGames), 'hfa'] = 0 # neutral site games
We isolate the variables needed to construct the model.
numeric_vars = ['hfa', 'defense_GES_x_rushing_off']
boolean_vars = ['rushing_team_def_x_rushing_off']
dummy_vars = []
total_vars = ['offense', 'defense'] + numeric_vars + boolean_vars + dummy_vars
df = PBP_df[['offense', 'defense'] + boolean_vars + numeric_vars + dummy_vars + ['ppa']]
To estimate offensive and defensive strength, we create dummy variables for each offense and defense. Offenses with higher coefficients are stronger, while defenses with lower coefficients are stronger.
dummies_df = pd.get_dummies(df[total_vars], columns=['offense', 'defense'], sparse=True)
dummies_df
| hfa | defense_GES_x_rushing_off | rushing_team_def_x_rushing_off | offense_Air Force2006 | offense_Air Force2007 | offense_Air Force2008 | offense_Air Force2009 | offense_Air Force2010 | offense_Air Force2011 | offense_Air Force2012 | ... | defense_Wyoming2013 | defense_Wyoming2014 | defense_Wyoming2015 | defense_Wyoming2016 | defense_Wyoming2017 | defense_Wyoming2018 | defense_Wyoming2019 | defense_Wyoming2021 | defense_Wyoming2022 | defense_Wyoming2023 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 0 | 0 | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1 | -1 | 0 | 0 | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 2 | 1 | 0 | 0 | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 3 | 1 | 0 | 0 | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 4 | 1 | 0 | 0 | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1582839 | 0 | 0 | 0 | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1582840 | 0 | 0 | 0 | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1582841 | 0 | 0 | 0 | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1582842 | 0 | 0 | 0 | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1582843 | 0 | 0 | 0 | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1582844 rows × 4275 columns
The resulting matrix is huge (1,582,844 x 4275) but is also sparse, with most entries equal to zero. We use scipy's sparse matrix tools to make the calculations more tractable, populating a lil_matrix with the values from dummies_df before converting it to CSR format.9
def dataframe_to_csr(df, numeric_vars):
arr = lil_matrix(df.shape, dtype=np.int8)
for i in range(len(df.columns)):
var = df.columns[i]
if var in numeric_vars: # Numeric Variables
ix = df[var] != 0
arr[np.where(ix), i] = df.loc[ix, var]
else: # Boolean / Dummy Variables
ix = df[var] != 0
arr[np.where(ix), i] = 1
return arr.tocsr()
dummies_df[numeric_vars] = dummies_df[numeric_vars].astype(pd.SparseDtype("int8", 0))
dummies_df[boolean_vars] = dummies_df[boolean_vars].astype(pd.SparseDtype("bool", False))
arr = dataframe_to_csr(dummies_df, numeric_vars)
We use sklearn to fit a linear regression model to the data before displaying the important results.
reg = linear_model.LinearRegression(fit_intercept=True).fit(X=arr, y=df['ppa'])
# Extract regression coefficients
reg_results_df = pd.DataFrame({
'coef_name': dummies_df.columns.values,
'reg_coef': reg.coef_})
reg_results_df.loc[len(reg_results_df.index)] = ['Intercept', reg.intercept_]
reg_results_df.loc[reg_results_df['coef_name'].isin([
'hfa', 'defense_GES_x_rushing_off',
'rushing_team_def_x_rushing_off', 'Intercept'])].round(4)
| coef_name | reg_coef | |
|---|---|---|
| 0 | hfa | 0.0108 |
| 1 | defense_GES_x_rushing_off | -0.0171 |
| 2 | rushing_team_def_x_rushing_off | -0.0879 |
| 4275 | Intercept | 0.1233 |
As expected, the coefficients on both experience variables, Defense GES x Rushing Offense and Rushing Team Defense x Rushing Offense, are negative. On plays with a Rushing Offense, the expected PPA:
It's reasonable to expect a Rushing Offense to get about 65 plays per game,10 so over the course of an entire game these amount to a decrease of about 1.11 PPA for each additional GES and 5.71 PPA for Own-Team Experience.
To get a sense of how robust this result is, we should construct confidence intervals on these estimates. Bootstrap sampling is a good method here because it doesn't assume that the model errors follow any particular distribution. Standard methods of constructing confidence intervals on model parameters assume normally distributed errors, but we can see that this assumption doesn't hold for our model.
y_hat = reg.predict(arr)
res = df['ppa'] - y_hat
df['y_hat'] = y_hat
df['res'] = res
ax = sns.histplot(data=df, x='res')
ax.set_title('Error Histogram')
ax.set_xlabel('Error')
plt.show()
plt.close()
Bootstrap sampling resamples the dataset (with replacement) a number of times. For each sample, we fit the model and keep track of the resulting parameter estimates. The bootstrap estimates of a parameter $\beta$ can give us a confidence interval on that parameter. To get a 95% confidence interval on $\beta$, we find the 0.025 and 0.975 percentiles of the bootstrap sample estimates of $\beta$.
With the bootstrap comes the interesting question of what we should be sampling from our dataset. Should we randomly sample from the games in the dataset or the plays in the dataset (or both)? Instead of choosing an answer, we'll implement both a "regular" bootstrap that samples plays and a "cluster" bootstrap that samples games (the "both" option would sample games and then sample plays from each game, but the computations are aleady pretty cumbersome so we'll skip this one).
We take 1000 regular bootstrap samples and print the results for the variables of interest.
def bootstrap_sample(PBP_df, numeric_vars, boolean_vars, dummy_vars, n_samples, seed=1):
np.random.seed(seed)
total_vars = ['offense', 'defense'] + numeric_vars + boolean_vars + dummy_vars
if len(dummy_vars) > 0:
dummy_vars_count = len(pd.get_dummies(PBP_df[dummy_vars], columns=dummy_vars, sparse=True).columns)
else:
dummy_vars_count = 0
interesting_var_count = len(numeric_vars) + len(boolean_vars) + dummy_vars_count
# get dummies
dummies_df = pd.get_dummies(PBP_df[total_vars], columns=dummy_vars+['offense', 'defense'], sparse=True)
return_vars = list(dummies_df.columns[:interesting_var_count]) + ['Intercept']
# convert dummies_df to csr
dummies_df[boolean_vars] = dummies_df[boolean_vars].astype(pd.SparseDtype("bool", False))
dummies_df[numeric_vars] = dummies_df[numeric_vars].astype(pd.SparseDtype("int8", 0))
arr = dataframe_to_csr(dummies_df, numeric_vars)
y = PBP_df['ppa']
size = len(PBP_df)
indices = [i for i in range(size)]
bootstrap_coefs = []
for _ in range(n_samples):
# sample plays
bootstrap_indices = np.random.choice(indices, size=size, replace=True)
arr_sample = arr[bootstrap_indices]
y_sample = y[bootstrap_indices]
# fit model
reg = linear_model.LinearRegression(fit_intercept=True)
reg.fit(X=arr_sample, y=y_sample)
# store interesting coefficients
bootstrap_coefs.append(list(reg.coef_[:interesting_var_count])+[reg.intercept_])
bootstrap_coefs = np.array(bootstrap_coefs) # get numpy array
bootstrap_info_df = get_bootstrap_info_df(bootstrap_coefs, return_vars)
return bootstrap_info_df
def get_bootstrap_info_df(bootstrap_coefs, return_vars):
# helper function to extract important statistics from bootstrap coefficient estimates
bootstrap_info = []
for i in range(bootstrap_coefs.shape[1]):
bootstrap_coefs_i = bootstrap_coefs[:, i]
col_i = return_vars[i]
mean = np.mean(bootstrap_coefs_i)
low_quantile = np.quantile(bootstrap_coefs_i, 0.025)
high_quantile = np.quantile(bootstrap_coefs_i, 0.975)
bootstrap_info.append([col_i, round(mean, 4), [round(low_quantile, 4), round(high_quantile, 4)]])
return pd.DataFrame(bootstrap_info, columns=['Variable', 'Bootstrap mean', '95% bootstrap interval'])
bootstrap_info = bootstrap_sample(PBP_df, numeric_vars, boolean_vars, dummy_vars, n_samples=1000)
bootstrap_info
| Variable | Bootstrap mean | 95% bootstrap interval | |
|---|---|---|---|
| 0 | hfa | 0.0108 | [0.0088, 0.013] |
| 1 | defense_GES_x_rushing_off | -0.0170 | [-0.0259, -0.0075] |
| 2 | rushing_team_def_x_rushing_off | -0.0882 | [-0.1331, -0.0437] |
| 3 | Intercept | 0.1233 | [0.1213, 0.1254] |
For Defense GES x Rushing Offense, the 95% bootstrap confidence interval is [-0.0259, -0.0075], and for Rushing Team Defense x Rushing Offense, the 95% bootstrap confidence interval is [-0.1331, -0.0437].
Next, we take 1000 cluster bootstrap samples and print the results for the variables of interest.
def cluster_bootstrap_sample(PBP_df, numeric_vars, boolean_vars, dummy_vars, n_samples, seed=1):
np.random.seed(seed)
PBP_df['idx'] = PBP_df.index
# interesting (non-team) variables with counts
total_vars = ['offense', 'defense'] + numeric_vars + boolean_vars + dummy_vars
if len(dummy_vars) > 0:
dummy_vars_count = len(pd.get_dummies(PBP_df[dummy_vars], columns=dummy_vars, sparse=True).columns)
else:
dummy_vars_count = 0
interesting_var_count = len(numeric_vars) + len(boolean_vars) + dummy_vars_count
# get dummies
dummies_df = pd.get_dummies(PBP_df[total_vars], columns=dummy_vars+['offense', 'defense'], sparse=True)
return_vars = list(dummies_df.columns[:interesting_var_count]) + ['Intercept']
# convert dummies_df to csr
dummies_df[boolean_vars] = dummies_df[boolean_vars].astype(pd.SparseDtype("bool", False))
dummies_df[numeric_vars] = dummies_df[numeric_vars].astype(pd.SparseDtype("int8", 0))
arr = dataframe_to_csr(dummies_df, numeric_vars)
# Some games have very few plays in the dataset due to data errors somewhere along the line
# We sample from games with at least 30 plays in the dataset
game_counts = PBP_df['game_id'].value_counts()
high_count_games = game_counts[game_counts>=30].index
n_games = len(high_count_games)
bootstrap_coefs = []
for _ in range(n_samples):
# sample games
game_sample = np.random.choice(high_count_games, size=n_games, replace=True)
sample_id_df = pd.DataFrame({'game_id': game_sample})
# get play data for sampled games
PBP_df_sample = pd.merge(sample_id_df, PBP_df, how='left', on='game_id')
y_sample = PBP_df_sample['ppa']
# get play data in sparse representation
sample_idx = PBP_df.set_index('id').loc[PBP_df_sample['id'], 'idx'].values
arr_sample = arr[sample_idx]
# fit model
reg = linear_model.LinearRegression(fit_intercept=True)
reg.fit(X=arr_sample, y=y_sample)
# store interesting coefficients
bootstrap_coefs.append(list(reg.coef_[:interesting_var_count])+[reg.intercept_])
bootstrap_coefs = np.array(bootstrap_coefs) # convert to numpy array
bootstrap_info_df = get_bootstrap_info_df(bootstrap_coefs, return_vars)
return bootstrap_info_df
cluster_bootstrap_info = cluster_bootstrap_sample(PBP_df, numeric_vars, boolean_vars, dummy_vars, n_samples=1000)
cluster_bootstrap_info
| Variable | Bootstrap mean | 95% bootstrap interval | |
|---|---|---|---|
| 0 | hfa | 0.0108 | [0.009, 0.0126] |
| 1 | defense_GES_x_rushing_off | -0.0171 | [-0.0268, -0.0065] |
| 2 | rushing_team_def_x_rushing_off | -0.0891 | [-0.1399, -0.0425] |
| 3 | Intercept | 0.1232 | [0.1214, 0.1251] |
For Defense GES x Rushing Offense, the 95% cluster bootstrap confidence interval is [-0.0268, -0.0065], and for Rushing Team Defense x Rushing Offense, the 95% cluster bootstrap confidence interval is [-0.1399, -0.0425]. These are slightly wider than the confidence intervals from the regular bootstrap, but all of them stop well short of zero. The negative coefficient estimate results seem to be robust given the model specification.
We started with the question of whether prior experience with Option Offenses improves a defense’s performance. We refined that to the question of whether Game or Own-Team experience with Rushing Offenses improves a defense's PPA allowed. We constructed a linear regression model that gives an affirmative answer for both types of experience.
Regression models of complex processes generally come with the potential for estimation bias introduced by model misspecification. With only 5 independent variables, our model is likely too simple to capture every important factor affecting PPA. Omitting variables that are correlated with both the regressors and the dependent variable can bias the coefficient estimates. Our analysis faces the risk of a false positive if we omitted some variable(s) correlated with both experience and PPA whose omission biased the coefficient estimates downward.
While we can't make this concern go away fully, we can at least see what happens if we add some unused variables from the dataset to the model.
Periods range from 0 to 11, with 0 likely being data errors. We group all non-regulation periods (not in 1-4) together with period 5 for a general "overtime" category. This gets rid of sparsely populated categories for when we create dummy variables.
pd.DataFrame(PBP_df['period'].value_counts())
| count | |
|---|---|
| period | |
| 2 | 420118 |
| 4 | 396740 |
| 1 | 382651 |
| 3 | 378189 |
| 5 | 4515 |
| 6 | 436 |
| 7 | 99 |
| 8 | 42 |
| 0 | 36 |
| 9 | 8 |
| 10 | 5 |
| 11 | 5 |
PBP_df['period_trunc'] = PBP_df['period']
PBP_df.loc[~PBP_df['period'].isin([1,2,3,4,5]), 'period_trunc'] = 5
Below, we see that mean PPA varies with the period. It's lower in overtime than it is in the 4th quarter, where it's lower than the first 3 quarters. There also appears to be an interaction effect with Rushing Offenses, which outperform other offenses in regulation but not overtime.
PBP_df.groupby(['period_trunc','rushing_off'])[['ppa']].mean()
| ppa | ||
|---|---|---|
| period_trunc | rushing_off | |
| 1 | 0 | 0.150206 |
| 1 | 0.210621 | |
| 2 | 0 | 0.130158 |
| 1 | 0.181691 | |
| 3 | 0 | 0.143942 |
| 1 | 0.211982 | |
| 4 | 0 | 0.079226 |
| 1 | 0.113120 | |
| 5 | 0 | -0.059520 |
| 1 | -0.090001 |
Mean PPA varies with the down, 3rd downs having the highest mean PPA and 1st downs having the lowest. There again appears to be an interaction with Rushing Offenses, which slightly outperform other offenses on 1st through 3rd downs but greatly outperform on 4th down. This is unsurprising since most 4th downs are rushing plays.
PBP_df.groupby(['down','rushing_off'])[['ppa']].mean()
| ppa | ||
|---|---|---|
| down | rushing_off | |
| 1 | 0 | -0.005040 |
| 1 | 0.042829 | |
| 2 | 0 | 0.105529 |
| 1 | 0.142021 | |
| 3 | 0 | 0.439912 |
| 1 | 0.472730 | |
| 4 | 0 | 0.091681 |
| 1 | 0.437262 |
PPA increases as the yards to the goal line at the beginning of the play increases. The association is slightly weaker for Rushing Offenses, but probably not by enough to warrant an interaction term in the model.
np.corrcoef(PBP_df.loc[PBP_df['rushing_off']==0, 'yards_to_goal'], PBP_df.loc[PBP_df['rushing_off']==0, 'ppa'])
array([[1. , 0.08941935],
[0.08941935, 1. ]])
np.corrcoef(PBP_df.loc[PBP_df['rushing_off']==1, 'yards_to_goal'], PBP_df.loc[PBP_df['rushing_off']==1, 'ppa'])
array([[1. , 0.07240664],
[0.07240664, 1. ]])
We'll add terms for the following variables to the model:
To control for a potential confounder, we'll also add defense_GES, which gives the defense's Game Experience Score regardless of whether they're facing a Rushing Offense.
PBP_df['down_4_x_rushing_off'] = (PBP_df['down']==4)*PBP_df['rushing_off'] # new
PBP_df['period_trunc_5_x_rushing_off'] = (PBP_df['period_trunc']==5)*PBP_df['rushing_off'] # new
numeric_vars1 = ['hfa', 'defense_GES', 'defense_GES_x_rushing_off', 'yards_to_goal']
boolean_vars1 = ['rushing_team_def_x_rushing_off', 'down_4_x_rushing_off', 'period_trunc_5_x_rushing_off']
dummy_vars1 = ['down', 'period_trunc']
total_vars1 = ['offense', 'defense'] + numeric_vars1 + boolean_vars1 + dummy_vars1
df1 = PBP_df[total_vars1 + ['ppa']]
dummies_df1 = pd.get_dummies(df1[total_vars1], columns=['offense', 'defense']+dummy_vars1, sparse=True)
dummies_df1
| hfa | defense_GES | defense_GES_x_rushing_off | yards_to_goal | rushing_team_def_x_rushing_off | down_4_x_rushing_off | period_trunc_5_x_rushing_off | offense_Air Force2006 | offense_Air Force2007 | offense_Air Force2008 | ... | defense_Wyoming2023 | down_1 | down_2 | down_3 | down_4 | period_trunc_1 | period_trunc_2 | period_trunc_3 | period_trunc_4 | period_trunc_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 1 | 0 | 80 | 0 | 0 | 0 | False | False | False | ... | False | True | False | False | False | True | False | False | False | False |
| 1 | -1 | 1 | 0 | 44 | 0 | 0 | 0 | False | False | False | ... | False | True | False | False | False | True | False | False | False | False |
| 2 | 1 | 1 | 0 | 43 | 0 | 0 | 0 | False | False | False | ... | False | True | False | False | False | True | False | False | False | False |
| 3 | 1 | 1 | 0 | 57 | 0 | 0 | 0 | False | False | False | ... | False | True | False | False | False | True | False | False | False | False |
| 4 | 1 | 1 | 0 | 39 | 0 | 0 | 0 | False | False | False | ... | False | False | True | False | False | True | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1582839 | 0 | 5 | 0 | 83 | 0 | 0 | 0 | False | False | False | ... | False | True | False | False | False | False | False | False | True | False |
| 1582840 | 0 | 5 | 0 | 94 | 0 | 0 | 0 | False | False | False | ... | False | True | False | False | False | False | False | False | True | False |
| 1582841 | 0 | 5 | 0 | 94 | 0 | 0 | 0 | False | False | False | ... | False | False | True | False | False | False | False | False | True | False |
| 1582842 | 0 | 5 | 0 | 94 | 0 | 0 | 0 | False | False | False | ... | False | False | False | True | False | False | False | False | True | False |
| 1582843 | 0 | 5 | 0 | 98 | 0 | 0 | 0 | False | False | False | ... | False | False | False | False | True | False | False | False | True | False |
1582844 rows × 4288 columns
We again need to convert the model inputs to scipy's CSR format for sparse matrices.
dummies_df1[numeric_vars1] = dummies_df1[numeric_vars1].astype(pd.SparseDtype("int8", 0))
dummies_df1[boolean_vars1] = dummies_df1[boolean_vars1].astype(pd.SparseDtype("bool", False))
arr1 = dataframe_to_csr(dummies_df1, numeric_vars1)
We train the new model and display the important parameter estimates.
reg1 = linear_model.LinearRegression(fit_intercept=True).fit(X=arr1, y=df1['ppa'])
# Extract regression coefficients
reg_results_df1 = pd.DataFrame({
'coef_name': dummies_df1.columns.values,
'reg_coef': reg1.coef_})
reg_results_df1.loc[len(reg_results_df1.index)] = ['Intercept', reg1.intercept_]
reg_results_df1.loc[reg_results_df1['coef_name'].isin([
'hfa',
'defense_GES', 'defense_GES_x_rushing_off',
'rushing_team_def_x_rushing_off',
'yards_to_goal',
'down_1','down_2','down_3','down_4',
'down_4_x_rushing_off',
'period_trunc_5_x_rushing_off',
'period_trunc_1','period_trunc_2','period_trunc_3','period_trunc_4','period_trunc_5',
'Intercept'])].round(4)
| coef_name | reg_coef | |
|---|---|---|
| 0 | hfa | 0.0143 |
| 1 | defense_GES | 0.0165 |
| 2 | defense_GES_x_rushing_off | -0.0243 |
| 3 | yards_to_goal | 0.0056 |
| 4 | rushing_team_def_x_rushing_off | -0.1009 |
| 5 | down_4_x_rushing_off | 0.2707 |
| 6 | period_trunc_5_x_rushing_off | -0.0748 |
| 4279 | down_1 | -0.2050 |
| 4280 | down_2 | -0.0787 |
| 4281 | down_3 | 0.2710 |
| 4282 | down_4 | 0.0127 |
| 4283 | period_trunc_1 | 0.0156 |
| 4284 | period_trunc_2 | 0.0050 |
| 4285 | period_trunc_3 | 0.0149 |
| 4286 | period_trunc_4 | -0.0395 |
| 4287 | period_trunc_5 | 0.0040 |
| 4288 | Intercept | -0.1180 |
Moving to the more complex model has actually decreased the important experience coefficients: from -0.0171 to -0.0243 for Defense GES x Rushing Offense and from -0.0879 to -0.1009 for Own-Team Experience. Over the course of an entire game (~65 plays), this amounts to a decrease of about 1.58 PPA for each additional GES and 6.56 PPA for Own-Team Experience.
We'll look at regular bootstrap confidence intervals (1000 samples, sampling plays) for the new model in case the added variables decrease the efficiency of the parameter estimates.
bootstrap_info1 = bootstrap_sample(PBP_df, numeric_vars1, boolean_vars1, dummy_vars1, n_samples=1000)
bootstrap_info1
| Variable | Bootstrap mean | 95% bootstrap interval | |
|---|---|---|---|
| 0 | hfa | 0.0142 | [0.0121, 0.0163] |
| 1 | defense_GES | 0.0165 | [0.0135, 0.0193] |
| 2 | defense_GES_x_rushing_off | -0.0245 | [-0.0336, -0.0154] |
| 3 | yards_to_goal | 0.0056 | [0.0055, 0.0057] |
| 4 | rushing_team_def_x_rushing_off | -0.0988 | [-0.1409, -0.0553] |
| 5 | down_4_x_rushing_off | 0.2787 | [0.1742, 0.392] |
| 6 | period_trunc_5_x_rushing_off | -0.0821 | [-0.2294, 0.0645] |
| 7 | down_1 | -0.2050 | [-0.2116, -0.1981] |
| 8 | down_2 | -0.0787 | [-0.0854, -0.0725] |
| 9 | down_3 | 0.2710 | [0.2638, 0.2782] |
| 10 | down_4 | 0.0127 | [-0.0062, 0.0314] |
| 11 | period_trunc_1 | 0.0154 | [0.0084, 0.0231] |
| 12 | period_trunc_2 | 0.0049 | [-0.0024, 0.012] |
| 13 | period_trunc_3 | 0.0149 | [0.0075, 0.0221] |
| 14 | period_trunc_4 | -0.0396 | [-0.0471, -0.032] |
| 15 | period_trunc_5 | 0.0044 | [-0.0218, 0.0314] |
| 16 | Intercept | -0.1179 | [-0.1282, -0.1075] |
For Defense GES x Rushing Offense, the 95% bootstrap confidence interval is [-0.0336, -0.0154], and for Rushing Team Defense x Rushing Offense, the 95% bootstrap confidence interval is [-0.1409, -0.0553].
Next, we'll look at confidence intervals from 1000 cluster bootstrap samples (sampling games).
cluster_bootstrap_info1 = cluster_bootstrap_sample(PBP_df, numeric_vars1, boolean_vars1, dummy_vars1, n_samples=1000)
cluster_bootstrap_info1
| Variable | Bootstrap mean | 95% bootstrap interval | |
|---|---|---|---|
| 0 | hfa | 0.0144 | [0.0123, 0.0163] |
| 1 | defense_GES | 0.0163 | [0.0129, 0.0194] |
| 2 | defense_GES_x_rushing_off | -0.0239 | [-0.0354, -0.0122] |
| 3 | yards_to_goal | 0.0057 | [0.0056, 0.0058] |
| 4 | rushing_team_def_x_rushing_off | -0.1014 | [-0.1606, -0.048] |
| 5 | down_4_x_rushing_off | 0.2791 | [0.17, 0.3909] |
| 6 | period_trunc_5_x_rushing_off | -0.0744 | [-0.2186, 0.0851] |
| 7 | down_1 | -0.2071 | [-0.2139, -0.2005] |
| 8 | down_2 | -0.0798 | [-0.0866, -0.073] |
| 9 | down_3 | 0.2715 | [0.2643, 0.2788] |
| 10 | down_4 | 0.0153 | [-0.0042, 0.0348] |
| 11 | period_trunc_1 | 0.0152 | [0.0076, 0.0229] |
| 12 | period_trunc_2 | 0.0048 | [-0.0025, 0.0125] |
| 13 | period_trunc_3 | 0.0148 | [0.0069, 0.0221] |
| 14 | period_trunc_4 | -0.0396 | [-0.0471, -0.0318] |
| 15 | period_trunc_5 | 0.0049 | [-0.0223, 0.0314] |
| 16 | Intercept | -0.1215 | [-0.132, -0.1106] |
For Defense GES x Rushing Offense, the 95% cluster bootstrap confidence interval is [-0.0354, -0.0122], and for Rushing Team Defense x Rushing Offense, the 95% bootstrap confidence interval is [-0.1606, -0.048]. Again, the cluster bootstrap confidence intervals are wider than the regular bootstrap intervals, but not by enough that they include zero.
As discussed in the previous section, one way to explore this topic further is to check whether the result still holds if other relevant variables are found and added to the model.
We could look at more sophisticated measures of experience. The GES as defined in this project weights games from two seasons ago as heavily as it weights games from the last or current season. We could develop a metric that weights recent experience more heavily. The GES also doesn't take coaching changes or player turnover into account. If a school hires a new coaching staff that brings in a bunch of transfer players, the school's experience against Rushing Offenses from previous years seems much less relevant.
Relatedly, it would be interesting to check whether the effects of different experience metrics have changed over time. The recent advent of the transfer portal has led to more roster turnover, which seems like it would weaken the effect of the GES (and maybe Own-Team Experience too).
It's possible that some of the relationships between the independent variables and PPA are non-linear. We'll look at a bar plot of the most recent model errors by Defense GES x Rushing Offense (truncated to group values $\geq 5$ together).
df1['defense_GES_x_rushing_off_trunc'] = df1['defense_GES_x_rushing_off']
df1.loc[df1['defense_GES_x_rushing_off_trunc'] > 5, 'defense_GES_x_rushing_off_trunc'] = 5
y_hat1 = reg1.predict(arr1)
res1 = df1['ppa'] - y_hat1
df1['y_hat'] = y_hat1
df1['res'] = res1
sns.barplot(df1, x='defense_GES_x_rushing_off_trunc', y='res', capsize=0.1)
plt.title('Mean error bar chart')
plt.ylabel('Error')
plt.show()
plt.close()
There's some evidence of non-linearity here. The mean error for GES 1-3 is positive while the mean error for GES 4+ is negative. A negative error means that the model overestimates the true PPA, which means that the model underestimates the performance of the defense. Since the error for GES 4+ is negative, it's possible that the model underestimates how much a defense benefits from having lots of recent Game Experience against Rushing Offenses.
[1] https://web3.ncaa.org/directory/memberList?type=12&division=I-FBS&sportCode=MFB
[2] http://fs.ncaa.org/Docs/AMA/Division%20I%20Forms/2010-11%20FBS%20Forms/Football%20Bowl%20Subqa%2012%208%2010.pdf
[3] https://www.reddit.com/r/MapPorn/comments/prwu15/ncaa_football_team_map_updated_2021/
[4] https://www.defense.gov/News/Releases/Release/Article/2683495/statement-by-secretary-of-defense-lloyd-j-austin-iii-on-the-nfl-waiver-for-came/
[5] https://nflbreakdowns.com/2015/08/the-flexbone-option-offense-introduction/
[6] https://247sports.com/college/smu/Article/Army-Navy-game-2021-triple-option-what-makes-it-difficult-to-defend-177772816/
[7] https://collegefootballdata.com/glossary
[8] https://blog.collegefootballdata.com/opponent-adjusted-stats-ridge-regression/
[9] https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
[10] https://www.teamrankings.com/college-football/stat/plays-per-game